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Abstract 

Recently several authors have proposed stochastic models of the growth of the Web 
graph that give rise to power-law distributions. These models are based on the notion 
of preferential attachment leading to the "rich get richer" phenomenon. However, these 
models fail to explain several distributions arising from empirical results, due to the fact 
that the predicted exponent is not consistent with the data. To address this problem, we 
extend the evolutionary model of the Web graph by including a non-preferential compo- 
nent, and we view the stochastic process in terms of an urn transfer model. By making 
this extension, we can now explain a wider variety of empirically discovered power-law 
distributions provided the exponent is greater than two. These include: the distribution 
of incoming links, the distribution of outgoing links, the distribution of pages in a Web 
site and the distribution of visitors to a Web site. A by-product of our results is a formal 
proof of the convergence of the standard stochastic model (first proposed by Simon). 

1 Introduction 

A power-law distribution is a function of the form 

/(») = c r T , 

where C and r are positive constants. Power-law distributions are scale-free in the sense that 
if i is rescaled by multiplying it by a constant, then f(i) would still be proportional to i~ T . 

Power-law distributions are abundant, for example Zipf's law | |Rap82| | , which states that 
relative frequency of words in a text is inversely proportional to their rank, and Lotka's 
law [Nic89|, which is an inverse square law stating that the number of authors making n 



contributions is proportional to n 2 . (We refer the reader to | Sch91 | for more examples of 
power-law distributions.) 



Recently several researchers have detected power-law distributions in the Internet [ FFF99| 



and World- Wide- Web [|BKM + 00| , pKM + 01| 1 topologies. In order to understand how these 



power-law distributions emerge and how the Web has evolved and is evolving, several re- 
searchers have recently been studying stochastic models of graphs which give rise to such 
distributions. One particular power-law phenomenon that has attracted attention is the dis- 
tribution of incoming links to a Web page. This distribution is important, since a link from 
Web page P to Web page Q can be viewed as a recommendation of page Q; thus Web pages 
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having more incoming links are more highly recommended and therefore potentially of higher 



quality. This observation is the basis of Google's PageRank algorithm HenOl ]. 

Albert et al. [ABJOC] studied a stochastic model of growth and preferential attachment, 
where new links to existing Web pages are added in proportion to the number of incoming 
links these Web pages already have. Their theoretical model predicts an exponent r = 3, 
which is not in agreement with the value of approximately 2.1 obtained from the study 
reported in [BKM + 0"0|. Dorogovtsev et al. [DMSOOa] generalise Albert et al.'s model and 



predict an exponent greater than two. More precisely, they obtain the value 2 + A/m for 
the exponent, where A is the initial attractiveness of a newly created Web page and m is the 
number of new links added to the Web graph at each step of the stochastic process. This 
exponent value is consistent with the empirical value of the exponent of the distribution of 
incoming links provided A/m is sufficiently small. Bornholdt and Ebel [ BEOCfl pointed out 
that the stochastic process proposed by Simon | Sim55f| in 1955 can also offer an explanation 
of the power-law distribution. (We note that during the period of 1959-1961 there was a fierce 
debate between Mandelbrot and Simon in Information and Control on the validity of Simon's 
model [ Man59| 1.) In reply to Bornholdt and Ebel, Dorogovtsev et al. [ DMS00b[| note that the 
model they describe in [ DMSOOa essentially coincides with Simon's model. 



The models discussed above are based on the process of preferential attachment and do 
not take into account the fact that links may also be added or removed randomly through a 
non-preferential process. By this we mean that the probability of adding or removing a link 
to a particular Web page may be influenced by factors other than the popularity of that Web 
page, where popularity is measured by the number of incoming links. Our main contribution 
in this paper is to extend Simon's model [3im55] with a non-preferential component and view 



the stochastic process in terms of an urn transfer model fJK77|1 . (We note that, at the end 
of Section 3 of his seminal paper, Simon suggested adopting a mixture of preferential and 
non-preferential components but did not develop the idea.) By making this extension we 
can explain a wider variety of empirically discovered power-law distributions than can be 
explained with Simon's original model. These include: the distribution of incoming links, the 
distribution of outgoing links, the distribution of pages in a Web site and the distribution of 
visitors to a Web site. 

The rest of the paper is organised as follows. In Section ^ we present an urn transfer 
model that generalises Simon's original model. In Section |3] we demonstrate how this can 
provide a stochastic model for the evolution of the Web that is consistent with a wide range 
of empirical data. Finally, in Section |] we give our concluding remarks. The proofs of some of 
the mathematical results are given in the Appendix. As far as we are aware, our convergence 
proof given in the Appendix is the first formal proof validating Simon's model — it does not 
rely on the mean-field theory approach, as for example in [BAJ9S]. 



2 An Urn Transfer Model 



We now present an urn transfer model [JK77] for a stochastic process that we will use in 
Section || to analyse the evolution of the Web graph. Our model is an extension of Simon's 
stochastic process [ Sim55[| , which was originally described in terms of the underlying process 
leading to the distribution of words in a piece of text. Simon's stochastic process is essentially 
a birth process, where there is a constant probability p that the next word is a new word and, 
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given that the next word has already appeared, its probability of occurrence is proportional 
to the previous number of occurrences of that word. We extend Simon's model by setting 
the probability of occurrence of a word, given that it has already appeared, to be a weighted 
average of the preferential probability, as described above, and the uniform probability if 
all words were equiprobable. As we noted in the introduction, this extension was already 
proposed by Simon at the end of Section 3 of his paper. Simon set out this extension in 
equation (3.7) and presented a tentative solution in equation (3.8). As we will see in Section]^, 
our extension of Simon's model makes sense in the context of the Web, since, for example, 
links to a Web site are often added or removed in a random fashion without taking into 
consideration the "attractiveness" of the site in terms of how many links it already has. In 
our urn model, urns contain balls which have pins attached to them. For example, balls 
could represent Web pages and pins could represent inlinks or outlinks. An urn would then 
correspond to the set of Web pages having a specific number of links. We now describe our 
urn model in detail. 

We assume a countable number of urns, urrii, i = 1,2, 3, . . ., where each ball in urrii has i 
pins attached to it. Initially, at stage k = 1 of the stochastic process, all the urns are empty 
except urri\ which has one ball in it. Let Fi(k) be the number of balls in urrii after k steps 
of the stochastic process, so -Fi(l) = 1, and let p and a be parameters, with < p < 1 and 
a > —1.* Then, at stage k + 1 of the stochastic process for k > 1, one of two things may 
occur: 



(i) with probability Pk+i, where 

(i-p)£ti(» + <*)*i(*0 m 
k{\ + ap) + a(l - p) ' [ ) 

a new ball is added to urri\ (provided that < Pk+i < 1) or, 

(ii) with probability 1 — Pk+i, an urn is selected — urrii being chosen with probability 

(1 - p)(i + a)Fi(k) 



k(l + ap) + ail — p) ' 



(2) 



for 1 < i < k; then one ball from urrii is transferred to wnj+i. This is equivalent to 
attaching an additional pin to the ball chosen from urrii and moving it to its "correct" 
urn. The probability (0) is a combination of a preferential component (proportional 
to the number of pins in urrii) and a non-preferential component (proportional to the 
number of balls in urrii). (We note that the denominator appearing in (Q) and @ has 
been chosen so that Pk+i, the expected value of the probability of adding a ball, is p; 
see (H) below.) 



At each stage we either add a new ball with one pin or add a pin to an existing ball and 
move the ball to the next urn up, so at stage k the total number of pins is k, i.e. 

k 

Y / iF l (k) = k. 

i=l 



The reader should note that in [:Sim55] the quantity a corresponds to our parameter p. 
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It is obvious that Fi(k) = for any i > k. We call the above model the pk-model. 

Let B(k) = Y^,i Fi(k), the total number of balls in all the urns. We can now simplify ([I]) 

to 

_ _ (l-p)(k + aB(k)) 
Pfc+1 k(l + ap) + a(l- P y [) 

Since it is clear from (||) that Pk+i < 1, in order for pk+i to be well defined, we must have 
Pk+i > for k > 1, i.e. 

(1 - p)(k + aB(k)) < k(l + ap) + a(l-p). (4) 



In the Appendix we show that Pk+i is always well defined (i.e. non-negative) for all k 
when p > 1/2, but only if 

when p < 1/2. In the discussion at the end of Section ^ we suggest that, in practice, starting 
from a typical initial configuration of balls in the urns, it is likely that Pk+i will be well defined 
for all k, even if (0) does not hold. 

We next make a small digression to explain our use of (||) rather than the more natural 
definition of the probability as 

(1 - p)(i + a)Fi(k) (l-p)(i + a)Fi(k) 



£ti(* + «)*i(*) k + aB(k) 



(6) 



In order to find a solution for the expected value of Fi(k), we would need to take the 
expected value of @; this is problematic since the random variables B{k) and Fi(k) are not 
independent and it is therefore not clear how to calculate the expectation of the right-hand 
expression in (H). We observe that this problem does not arise in Simon's original model 
[ Sim55| |, since in this case we have a = and the denominator reduces to the constant k in 
both (||) and (|6|). In our case, when a is not necessarily zero, by using @ instead of the more 
natural (^), there is no problem in computing the expectation of Fi(k), since the parameter 
p is a constant, allowing us to find the expected value of F{{k) by using the linearity of 
expectations. 

In the Appendix we prove the following results for the expectations of B(k) and pk for 
k > 1, namely 

E(B(k))=E(j2Fi(k))=l + (k-l)p (7) 
^ i=i ' 

and 

E(p k ) = p. (8) 

(We note that E(B(1)) = B(l) = 1.) 

Thus, in terms of expectations (i.e. using a mean- field theory approach), it is possible to 
describe the urn transfer model as a "more natural" stochastic process, where at each stage 
k, for k > 1, either 
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(i) a new ball is inserted into urni with probability p, or 

(ii) with probability 1 — p an urn is chosen, the probability of choosing urrii being propor- 
tional to (i + a)Fi(k), and then one ball from urrii is transferred to urni + \. 

We stress that, since this model uses the expectations of the random variables pk+i rather 
than the random variables themselves, it is only an approximation of our urn transfer model. 
This model, which we call the p-model, is in fact the "more natural" model discussed above 
that uses (||) instead of @. 

We note that we could modify the initial condition of our stochastic process so that, for 
example, urni would initially contain 5 > 1 balls instead of one, or more generally that a 
finite number of urns would initially be non-empty with some prescribed number of balls in 
each. As can be seen from the development of the model below, as k tends to infinity, such a 
change in the initial conditions will not have an effect on the asymptotic distribution of the 
balls in the urns. 

We call the transfer of a ball as a result of (ii) above a mixture of preferential and non- 
preferential transfer. When a = 0, then the transfer is purely preferential otherwise non- 
preferential transfer takes a part in the process. 

Following Simon [3im55], we now state the equations for the p^-model. For i > 1 (including 



i > k), we have: 

E k (Fi(k + 1)) = Fi(k) + p k ( (i - 1 + a)*i-i(A0 - (i + a)F(k) ) , (9) 



where E k (Fi(k + 1)) is the expected value of Fi(k + 1) given the state of the model at stage 
k, and 

k(l + ap) + a(l — p) ' 

the normalising constant used in (0). 

Equation (^) gives the expected number of balls in urn^ as the previous number of balls in 
that urn plus the difference between the probability of increasing the number of balls in urni, 
which is equal to the probability of choosing urrii— l i n step (ii) of our urn transfer model, and 
the probability of decreasing the number of balls in urni, which is equal to the probability of 
choosing urni. 

In the boundary case, i = 1, we have 

E k {F x {k + 1)) = F x {k) +p k+l -p k (l + a)Fi(fc), (10) 

which describes the expected number of balls in urni, which is the previous number of balls 
in the first urn plus the difference between the probability of inserting a new ball in urni and 
the probability of transferring a ball from urni to urn2- 

Now letting 



1 + ap"" 

we see that k(5 k ~ /3 for large k. In fact, for k > 1, 



f5-k(j k = am- (11) 
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Using the facts that < p < 1 and a > — 1, it is also easy to see that 

< /3 < 1, (12) 

and, for k > 1, 

< A < — |— . (13) 
k + a 

Since the right-hand sides of (||) and (|l0|) are linear in the random variables, using (|8|), 
we may take expectations to obtain 



E(F t (k + 1)) = E(Fi(k)) +p h Ui-l + a )E{F^ x {k)) - (i + a)E(Fi(k))j (14) 
for i > 1, and 

£?(F 1 (fc + l))=S(Fi(fc))+p-/3fc(l + a)£?(Fi(fc)). (15) 

In order to solve equations (|i~4|) and ( |i~5"l ) we show that E(Fi(k))/k tends to a limit fi, 
as A: tends to infinity. Assume for the moment that this is the case, then, letting k tend to 
infinity, E(Fi(k + 1)) - E(Fi(k)) tends to fi and (3 k E(Fi{k)) tends to /?/•; so (|4|) and (H) 
yield 

/i = ^((i-l + a)/ i _i-(i + a)/ i ) (16) 

for i > 1, and 

h=p-P{l + a)h. (17) 



Now let us define fi, i > 1, by the recurrence relation (|T(|) with boundary condition (17). 
We may then let 

E(F i (k)) = k(f i + € itk ), (18) 

and in the Appendix we prove that e^fc tends to zero as k tends to infinity. This justifies our 
assumption that E(Fi(k))/k tends to /, as k tends to infinity. We therefore see that fi is the 
asymptotic expected rate of increase of the number of balls in urrii. 

From (11) and (0) we obtain 

fi = \ ' ni ■ ; r fi— l (19) 
1 + pyi + a) 



and 



a = rrfe- (20) 



respectively. 

Now, on repeatedly using (|l~9l), we get 

p p (1 + a) (2 + a) • • • (i - 1 + a) 



fi 



(1 + p + a) (2 + p + a) ••• (i + p + a) 
p pT(i + a) T(l + p + a) 
r(l + a) r(i + l + p + a)' 



(21) 
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where p = 1//3 and T is the gamma function ]AS7i| 6.1]. 



It follows that for large i, on using Stirling's approximation [AS72, 6.1.39], we have 

fi-Cr^, (22) 

where C is independent of i and ~ means is asymptotic to. Thus we have derived in ( [22] ) a 
general power-law distribution for /j, with exponent 1 + p. An obvious consequence of ( |l9|) 
is that /j > /i+i, i.e. asymptotically there are more balls in urrii than in urrii+i. 

It follows from (||), (^) and (||) that ([!(]) and ( |i~7| ) will also hold for the asymptotic 
distribution for the p-model obtained using the mean-field theory approach. Hence, on the 
assumption that this approach is valid, the asymptotic distribution will be the same as for 
the pfc-model, as given by (21) and (22). 

When a = then the extended model reduces to Simon's original model and by increasing 
a the exponent will increase accordingly. In any case the exponent is always greater than 2, 
so the expected number of pins per ball is finite. The constraint that p > 1 is equivalent to 
the condition that a > — 1 . Another way to understand this constraint is that if a < — 1 then 
the first urn will never be chosen in case (ii) of the stochastic process, and thus no ball will 



ever be transferred out of urn\. When p is close to 1 we obtain Lotka's law [Nic89], which 
is an inverse square power-law; see also Price's cumulative advantage distribution leading to 



Lotka's law |KH95 |. 

In many real situations, such as the Web, p is generally small. For example, if we interpret 
balls as Web pages and the number of pins attached to a ball as the number of links incoming 
to that Web page, then we expect the ratio of pages to links to be quite small, say 0.1, and 
thus the exponent of the power-law to be just over two. If the value of p and the power- 
law exponent are obtained from empirical evidence, we may find a discrepancy from Simon's 
original model, i.e. when a = 0. Our current extension can explain this discrepancy through 
the non-preferential component as long as the exponent is greater than two. 



3 A Model for the Evolution of the Web 

We now describe a discrete stochastic process by which the Web graph could evolve. At each 
time step the state of the Web graph is a directed graph G = (N, E), where N is its node set 
and E is its link set. In this case Fi(k), i > 1, is the number of nodes in the Web graph having 
i incoming links; Fi(k) induces an equivalence class of nodes in N all having i incoming links. 
We note that although we have chosen % to denote the number of incoming links, i could 
alternatively denote the number of outgoing links, the number of pages in a Web site or any 
other reasonable parameter. 

Consider the evolution process of the Web graph with respect to the number of nodes 
having i incoming links at the kth step of the process. Initially G contains just a single node. 
At each step one of two things can happen. With probability p a new node is added to G 
having one incoming link. In the p- model, this is equivalent to placing a new ball in urn\. 
With probability 1 — p a node is chosen to receive a new incoming link, with the probability 
of choosing a given node being proportional to (i + a), where i is the number of incoming 
links the node currently has. In the p-model this is equivalent to a mixture of preferential 
and non-preferential transfer of a ball from urrii to urrii+i; the mixture level depends on the 
value of the parameter a. 
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When a = a node is chosen according to preferential attachment, i.e. in proportion to 
the number of inlinks to that node. In this case the number of inlinks to a Web page could 
be interpreted as a measure of how important or recommended the Web page is. A situation 
when a > might occur if there is a choice of Web pages to link to and the actual decision of 
which links are put in place has a random component. For example, if we were to add several 
links to Web pages pertaining to Zipf 's law to our Web site, we might randomly choose them 
from a resource containing hundreds of such links. A situation when a < might occur if we 
consider internal links within Web pages not to be relevant when measuring the distribution 
of inlinks to Web pages. The justification for this view is that internal links do not contribute 
to the external "visibility" of a Web page. 

We now look at some of the measurements of the Web graph which were reported recently. 
Broder et al. [|BKM + 00(| reported a power-law distribution with exponent 2.09 for the number 



of incoming links (referred to as inlinks) to a node. This value was derived from a 203 million 
node crawl of the Web graph. The average number of inlinks per Web page was measured at 



about 8 | KRRT99 |, which gives us a value of 0.125 for p. We can compute a by 

p(l-p)-l 



a 



Thus a more accurate model of the stochastic process generating the distribution of in- 
coming links would assume a ~ —0.37 rather than a = 0. (It would not be unreasonable in 
this case to assume Simon's model, i.e. a = 0, which would give an exponent of 2.14, since 
the small difference in the exponents may be due to statistical error.) 

Looking at the outgoing links (referred to as outlinks) from a node, Broder et al. [ BKM + 0"[f 
reported a power-law distribution with exponent 2.72. Moreover, the average number of 
outlinks per Web page was measured at about 7.2 | KRR + 00| , which gives a value of 0.14 for 
p. Thus to get an exponent of 2.72 we would have to assume a ~ 3.42. However, Simon's 
original model would predict an exponent of about 2.16 for outlinks, similar to that for inlinks. 
The positive value of a may have occurred due to the fact that outlinks are often created 
for reasons other than preferential attachment, for example, in order to maintain the local 
structure of a Web site. 

Another interpretation of i is the number of pages within Web sites (referred to as web- 
pages). In this case, Huberman and Adamic | HA99(| reported a power-law distribution with 
exponent 1.85, derived from a 250,000 Web site crawl. Our model cannot explain this obser- 
vation as the exponent is less than two. A more recent result from a private communication 
with Adamic reported an exponent of 2.2, derived from a 1.6 million Web site crawl; the 
difference is possibly due to a different crawling strategy. To calculate p we can estimate the 
size of the Web to be 2.1 billion pages |MM0qi distributed over approximately 113.5 million 
Web sites (this number, which was reported on www.netsizer.com during the first quarter of 
2001, refers to the number of Internet hosts, so it is an over-estimate of the number of Web 
sites). Thus we can derive a value 0.054 for p; in reality p will be even closer to zero. To get 
an exponent of 2.2 we would have to assume a ~ 2.50. This gives a more accurate description 
than we would obtain from Simon's original model, which would predict an exponent of 2.06. 
The positive value of a may have occurred due to the fact that pages in Web site may be 
created in different ways, for example, pages may be created dynamically by a content man- 
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agement system. This may tend to increase the number of pages by adding certain generated 
pages. 

As a final interpretation, let % be the number of users visiting a Web site during the 
course of a day (referred to as visitors). In this case, Adamic and Huberman | AHOOj ] reported 
a power-law distribution with exponent 2.07, derived from access logs of 60,000 AOL users 



accessing 120,000 Web sites. Now, from www.netsizer.com we obtain the statistic that in the 
USA there were 72.7 million Internet hosts and 166.6 million users at the beginning of 2001. 



Moreover, from www.netvalue.com we obtain the statistic that, on average, users visit about 
1.93 different Web sites per day. So, we derive the value 72.7/(166.6 * 1.93) sa 0.226 for p, on 
the assumption that each Web site gets visited at least once per day. Thus to get an exponent 
of 2.07 we would have to assume a ~ —0.76. However, Simon's original model would predict 
an exponent of about 2.29. The negative value of a may have occurred due to the fact that 
some visitors to a Web site may tend to avoid well-trodden sites which may have too much 
commercial content. 

In order to validate our model, we programmed a simulation of the stochastic model 
using the parameter values we have derived for p and a and compared the exponent values 
obtained with the reported empirical values. (Our simulation is in the spirit of Simon and 
Van Wormer's Monte Carlo simulation, whose intention was to test how good the estimates 
of the original model are [ 5V63| .) We repeated the simulation five times using the j^-model, 
and five times using the p-model. Each simulation was carried out for 200,000 iterations, 
and for the purpose of regression we considered only the first 25 urns. The results of our 
simulations are presented in Table in all cases the correlation coefficient of the regression 
analysis was close to one. The discrepancy between the simulated values and the empirical 
values can be attributed in part to the fact that ( p2|) is only an asymptotic approximation to 
(|2l|). It is also possible that running the simulations for a much larger number of iterations 
would give more accurate results. 



Interpretation 


Empirical 


Pfc-model 


p- mo del 


inlinks 


2.09 


2.096 


2.094 


outlinks 


2.72 


2.714 


2.675 


webpages 


2.2 


2.122 


2.208 


visitors 


2.07 


2.131 


2.179 



Table 1: Power law exponents of simulation results 



For outlinks and webpages we restarted the p^-model simulation whenever the computed 
value of Pk+i was ill-defined, i.e. negative; only a moderate number of restarts were necessary. 
From (|3|) it can be shown that, for k > 1, \pk+i — Pk\ is of the order of l/k. This indicates 
that for large k it is very unlikely that Pk+i will be ill-defined, given that pj is well defined 
for j < k. In practice, if instead of starting with just one ball in urn\, we start from a typical 
initial configuration with a modest number of balls in the urns, it is likely that Pk+i will be 
well defined for all k. 

To illustrate this point, let us now examine more closely the situation regarding restarts 
for outlinks, rounding off p to be 0.15 and a to be 3.5. It can be verified that the probability 
that P3 be ill-defined is 0.15, that p^ be ill-defined is about 0.1905, that p§ be ill-defined is 
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Batch 


Overall 


k < 10 


Average k 


Max k 


1 


66 


89% 


3.78 


21 


2 


63 


90% 


3.86 


26 


3 


63 


90% 


3.34 


13 


4 


60 


88% 


3.68 


30 


5 


63 


90% 


3.73 


22 


6 


65 


92% 


3.49 


22 


7 


61 


92% 


3.50 


17 


8 


64 


94% 


3.54 


22 


9 


64 


86% 


4.19 


34 


10 


59 


92% 


3.49 


21 


Average 


63 


90% 


3.66 


23 



Table 2: Statistics for restarts, with p = 0.15 and a = 3.5 

about 0.1769 and that p§ be ill-defined is 0. Thus the total probability of pk being ill-defined 
for k < 6 is about 0.5174. Table || shows the values of a simulation where the p/t-model was 
run 1000 times in batches of 100 runs each. Whenever p^ was ill-defined for a given run, 
this run was considered to be a restart and the simulation moved on to the next run. The 
second column shows the numbers of restarts within the batch, the third column shows the 
percentage of the restarts observed with p^ ill-defined for k < 10, the fourth column shows 
the average stage at which the restarts became ill-defined and the fifth column shows the 
maximum stage at which any restart became ill-defined. Thus, if the process is well defined 
for, say, 50 or more stages, it is very unlikely to become ill-defined at a later stage. 

4 Concluding Remarks 

We have extended Simon's classical stochastic model by adding to it a non-preferential com- 
ponent which is combined with preferential attachment. When viewing this stochastic process 
in terms of an urn transfer model, this amounts to choosing balls proportional to the number 
of times they have previously been selected (i.e. the number of pins) plus a constant a > — 1. 
From the equations of this process we derived an asymptotic formula for the exponent of the 
resulting power-law distribution. As far as we are aware our proof given in the Appendix is 
the first formal proof of the convergence of Simon's model; unlike in previous work, we do 
not rely on the mean-field theory approach. 

Utilising our result we are able to explain several power-law distributions in the Web graph, 
which we now summarise. For the distribution of incoming links we derived a ~ —0.37, for 
the distribution of outgoing links we derived a ~ 3.42, for the distribution of pages in a 
Web site we derived a ~ 2.50 and, finally, for the distribution of visitors to a Web site we 
derived a ~ —0.76. In all cases our extension of Simon's original model can better explain the 
exponent of the power-law distribution, indicating that there is some mixture of preferential 
and non-preferential attachment in the selection process. 

The power law distribution that we have established can be stated as a hypothesis: in 
order to explain the evolution of the Web graph both preferential and non-preferential processes 
are at work. This hypothesis is more consistent with empirical data than the one which utilises 
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only preferential attachment. Our model is still limited to the cases where the exponent of the 
power-law distribution is greater than two. We are currently investigating a possible model 
which could yield an exponent less than two. 

A Appendix : Proofs 

We first prove ([/]) and (H). Since at stage k + 1 we add a new ball with probability Pk+i, 

E k (B(k + l)) = B(k)+p k+li 

so, taking expectations, 

E(B(k + l)) = E(B(k)) + E( Pk+1 ). (23) 



Lemma A.l For k > 1, 



E(B(k))=E(^Fi(k)) =l + (fc-l)p (24) 
^ i=i ' 

and 

E( Pk ) = p. (25) 

Proof. We prove the result by induction on k. For k = 2, remembering that -B(l) = 1 and 
using (H), it is easy to see that p2 = p, and thus, by using (j23|), that 



E(B(2)) = l + E(p 2 ) = l+p. 



Now assume that ( |24| ) and ( p5|) hold for some k, k > 1. Then, 

(l-p)(fc + qg(i?(fc))) = _ (l-p)(fc + a(l + (A:-l)p)) 
1 +1> k(l + ap) + a(l - p) jfe(l + ap) + a(l - p) 



and thus, using (|23D, 

E(B(k + 1)) = 1 + (k - l)p + p = 1 + kp. □ 

We now consider condition (||) needed for p/c+i to be well defined. 

Lemma A. 2 pu+i is always well defined (i.e. non-negative) for all k when p > 1/2, but only 
if 

a < — 

- 1 - 2p 

when p < 1/2. 
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Proof. In order that Pk+i > 0, condition ([|) must hold. This is equivalent to 

a(B(k) - 1 < i. 26 

1 — p 

There are three cases to consider: 
(I) When a = 0, there are no restrictions on p. 

(II) When — 1 < a < 0, it is straightforward to see that again there are no restrictions on 



p since, in this case, the maximum value of the left-hand side of (26) is zero, when 
B(k) = 1. 



(Ill) When a > 0, we see from (2£) that we must have 

a(B(k) - 1) 



P> 



a(B(k) - 1) + k{l + a) 



Setting B(k) to its maximum value k, this requires that 



< 27 > 

which will hold for all k provided 

a 

1 2a + 1 

in particular this holds for all a when p > 1/2 . However, for p < 1/2, for (|27|) to hold for all 
k we need 

a. < 

l-2p 



V 

a < — - — . □ 



We conclude by proving that as k tends to infinity E(Fi(k))/k tends to fi, justifying our 
derivation of the asymptotic distribution of the balls in the urns. We first state some useful 
properties of fi, which may be verified directly using (|^) and (|l7|). 

Lemma A. 3 

(I) For alU > 1, < /i < 1 and fi > f i+l . 
(II) E£i /i=P and E£ii/i = l- □ 



Theorem A. 4 For all i > 1, 



fe^oo K 
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Proof. Using ((T|) to rewrite @ and @, we obtain, for z > 1, 

(A; + + e i>fc+ i) = k(fi + e i>fc ) + fc/3 fc (i - 1 + a)(/i_i + e^) - fc/3 fc (z + a)(/j + e^) (28) 
and, for i = 1, 

(fc + l)(/i + ei,fc + i) = k(h + ei fc ) - k(3 k (l + a)(/i + ei fc ) +p. (29) 



and 



Equations ([u]) and (17) may be written in a similar form as 

(k + = kfi + (3{i - 1 + o)/i_i - /9(» + a)h 

(fc + l)/i = fc/i-/3(l + a)/i+p. 



(30) 
(31) 



For i > 1, subtracting ( ^0|) from (^) yields: 

(fc+l)e itfc+ i = A;ej )fc +A;/3 fe (i-l+a)ej_i )fe -fc/^ 

Using ( |l~6| ) and (11) this simplifies to 

(k + l)e iifc+ i = ( 1 - p k (i + a))k€ it k + Pk(i - 1 + a)ke^i )k - a/3 k fi- (32) 



Similarly, for i = 1, subtracting (|31| ) from (|29|) and using (|11|), we obtain: 

(fc + l)ei ifc+ i = (l - p k (l + a)) ke ljk + aPP k (l + a)fi. (33) 



From fl32|), by virtue of ( J13| ) and the fact that /j < 1 , we have for, 1 < i < k, 

(k + 1) \ e iM1 \< (l-p k (i + aj)k |e<,fc| +^(i-l + a)fc |ei_i )fc | + |a| (34) 



From (18) it follows that e^fc = — /, for i > k, so for i = k + 1 equation (32) becomes 

(k + l)e k+ x >k+1 = (3 k (k + a)fce fc>fc - f k+1 (k(l - (3 k (k + a)) - k(3 k + aP k J. (35) 



It follows that 



(k + 1) |e fc+ i )fe+ i| < f3 k (k + a)k \e kjk \ +f k+1 (k(l - (3 k (k + a)) + kfi k + \a\ f3 k ). 
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Using (|TT|) to substitute for (3 k , this gives 



n , i\ i i ^ /3(fc + a)fc \e kjk \ /, 2 / n ^ , , „ , , , «\ 

(fc + 1) efc+i,*+i < 7— 5 + 7— J fc (l-/?) + fc/3+ <* ff) 

k + ap k + ap\ > 



/3(fc + a)fc \e kk \ k / \ 

^ ^-33 + 7^— « ( 1+ I « I i (36) 



fc + a/3 fc + a/3 

since fc > 1, /3 < 1, and fe/ fc+1 < 1 by Lemma [0](II). We now define 

5 k = max | e ik \ = max | e ik \ . (37) 

1>1 l<i<fc+l 

(The two maxima are equal since 6i >k = —fi for i > k, and /j is monotonic decreasing.) 
On using (|37|), inequality |34|) yields 

(fc + 1) |ei, fe+ i|< (1-P k )k5 k + \a\p k (38) 

for 1 < % < k. 



Similarly, from (^), on using ([}?]) together with (|T^), ( |l3l ) and Lemma A.3| (I), it follows 
that 

(k + 1) |ei fc+ i|< (l-/3 fc (l + a)W+ |a|/3 fc (l + a). (39) 



Now let 

1+ I a I 

7 = — 

so again in a similar fashion from (EB) 



(fc + 1) efc+i,fe+i < — 7- — 7j— + ^— — s-. (40) 

fc + ap fc + ap 



We show by induction on fc that 

fc<5 fe < 7 . (41) 

From (g|) and @ we see that ft = max{l - fi,f 2 }- So, by Lemma [0|(I), @ holds 
for A; = 1. 



Now assume that ( |4l| ) holds for some fc > 1. So, for 1 < i < k, since | a |< 7, pq ) gives 

(fc + 1) |e 4 , fc+ i|< (l-/3 fc ) 7 + |a| /3 fe < 7. (42) 

For i = A; + 1 from fl4"0|) using ( |37| ) and ( Elf ) , we have 

,,,, , . /3(fc + a) 7 fc 7 (l-/9) ^ 

(fc + l)|e fe+ i lfe+ i| < ^ T 7^ + TT^- = 7 - (43) 
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For i > k + 1, since tift+i = —fi, by Lemma A.3 (II), 

(fc + l)|e i|fc+ i|=(fc + l)/i<l<7- 



(44) 



Similarly, for i = 1, ([39]) gives 

(fc + 1) |ei, fc+ i|< (l - j3 k (l + a)\j+ \a\ p k (l + a) < 7. (45) 

Therefore, from ©, (g|), @ and ©, (fe + l)<5 fe+1 < 7. 

So, by induction, ( [4l| ) holds for all fe > 1. Thus, to conclude the proof, we note that, as 
k tends to infinity, 6f. tends to 0, so en- tends to for all i. □ 



References 

[ABJ00] R. Albert, A.-L. Barbasi, and H. Jeong. Scale-free characteristics of random 
networks: the topology of the world-wide web. Physica A, 281:69-77, 2000. 

[AH00] L.A. Adamic and B.A. Huberman. The nature of markets in the World Wide 
Web. Quarterly Journal of Electronic Commerce, 1:5-12, 2000. 

[AS72] M. Abramowitz and LA. Stegun, editors. Handbook of Mathematical Functions 
with Formulas, Graphs and Mathematical Tables. Dover, New York, NY, 1972. 

[BAJ99] A.-L. Barbasi, R. Albert, and H. Jeong. Mean-field theory for scale free random 
networks. Physica A, 272:173-189, 1999. 

[BE00] S. Bornholdt and H. Ebel. World-Wide Web scaling exponent from Simon's 1955 
model. Condensed Matter Archive, cond-mat/0008465, 2000. 

[BKM + 00] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, A. Rajagopalan, R. Stata, 
A. Tomkins, and J. Wiener. Graph strucutre in the web. Computer Networks and 
ISDN Systems, 30:309-320, 2000. 

[DKM + 01] S. Dill, R. Kumar, K. McCurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins. 

Self-similarity in the Web. In Proceedings of International Conference on Very 
Large Data Bases, pages 69-78, Rome, 2001. 

[DMSOOa] S.N. Dorogovtsev, J.F.F. Mendes, and A.N. Samukhin. Structure of growing 
networks with preferential linking. Physical Review Letters, 85:4633-4636, 2000. 

[DMSOOb] S.N. Dorogovtsev, J.F.F. Mendes, and A.N. Samukhin. WWW and internet mod- 
els from 1955 till our day and the "popularity is attractive" principle. Condensed 
Matter Archive, cond-mat/0009090, 2000. 

[FFF99] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of 
the internet topology. In Proceedings of SIGCOMM Conference on Applications, 
Technologies, Architectures, and Protocols for Computer Communication, pages 
251-262, Cambridge, Ma., 1999. 



15 



[HA99] 



B.A. Huberman and L.A. Adamic. Evolutionary dynamics of the World Wide 
Web. Nature, 399:131, 1999. 



[HenOl] M.R. Henzinger. Hyperlink analysis for the web. IEEE Internet Computing, 
1:45-50, 2001. 

[JK77] N.L. Johnson and S. Kotz. Urn Models and their Application: An Approach 
to Modern Discrete Probability. Wiley Series in Probability and Mathematical 
Statistics. John Wiley & Sons, New York, NY, 1977. 

[KH95] M. Koenig and T. Harrell. Lotka's law, Price's urn, and electronic publishing. 
Journal of the American Society for Information Science, 46:386-388, 1995. 

[KRR + 00] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Up- 
fal. The web as a graph. In Proceedings of ACM Symposium on Principles of 
Database Systems, pages 1-10, Dallas, Tx., 2000. 

[KRRT99] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Extracting large-scale 
knowledge bases from the web. In Proceedings of International Conference on 
Very Large Data Bases, pages 639-650, Edinburgh, 1999. 

[Man59] B. Mandelbrot. A note on a class of skew distribution functions: Analysis and 
critique of a paper by H.A. Simon. Information and Control, 2:90-99, 1959. 

[MM00] B.H. Murray and A. Moore. Sizing the internet. White paper, Cyveillance, July 
2000. 

[Nic89] P.T. Nicholls. Bibliometric modeling processes and the empirical validity of 
Lotka's law. Journal of the American Society of Information Science, 40:379- 
385, 1989. 

[Pri76] D. de Solla Price. A general theory of bibliometric and other cumulative advantage 
processes. Journal of the American Society of Information Science, 27:292-306, 
1976. 

[Rap82] A. Rapoport. Zipf's law revisited. In H. Guiter and M.V. Arapov, editors, Studies 
on Zipf's Law, pages 1-28. Studienverlag Brockmeyer, Bochum, Germany, 1982. 

[Sch91] M. Schroeder. Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. 
W.H. Freeman, New York, NY, 1991. 

[Sim55] H.A. Simon. On a class of skew distribution functions. Biometrika, 42:425-440, 
1955. 

[SV63] H.A. Simon and T. Van Wormer. Some Monte Carlo estimates of the Yule dis- 
tribution. Behavrioral Science, 8:203-210, 1963. 



16 



